# Load necessary library
library(ggplot2)
# Create sample data
set.seed(123)
<- data.frame(Score = rnorm(100, mean = 50, sd = 10))
data
# Standardise a vector in the dataset
$Standardised_Score <- scale(data$Score)
data
# Plot original vs. standardised data
<- ggplot(data, aes(x = Score)) +
p1 geom_histogram(binwidth = 1, fill = "blue", alpha = 0.7) +
ggtitle("Original Data")
<- ggplot(data, aes(x = Standardised_Score)) +
p2 geom_histogram(binwidth = 0.1, fill = "green", alpha = 0.7) +
ggtitle("Standardised Data, mean = 0 and sd = 1")
::grid.arrange(p1, p2, ncol = 2) gridExtra
80 Standardisation and Normalisation in ML
80.1 Introduction
In machine learning, it’s common to scale data (especially predictor variables (IVs)) to ensure consistency in model application and interpretation.
Standardisation and normalisation are techniques that scale the data. They do it in slightly different ways, which can significantly impact the performance of the ML algorithms.
80.2 Standardisation
Standardisation involves shifting the distribution of each variable to have a mean of 0 and a standard deviation of 1.
It’s especially useful for algorithms that assume data is normally distributed, like logistic regression or linear regression.
We can use the
scale
function in R to standardise our data.
An example for one variable
An example where all predictors are scaled
# Create sample dataframe
set.seed(123)
<- data.frame(
data Variable1 = rnorm(100, mean = 20, sd = 5),
Variable2 = runif(100, min = 10, max = 50),
Variable3 = rnorm(100, mean = 0, sd = 1),
Variable4 = rbinom(100, size = 10, prob = 0.5),
Variable5 = runif(100, min = 0, max = 100) # This variable will not be scaled
)
# Scale first four variables by creating another dataframe
<- as.data.frame(lapply(data[1:4], scale))
data_scaled
# Add the unscaled Variable5 back into the scaled dataframe
$Variable5 <- data$Variable5
data_scaled
# Display scaled data
head(data_scaled)
Variable1 Variable2 Variable3 Variable4 Variable5
1 -0.71304802 -0.8402942 0.8265119 -0.6624077 23.72297
2 -0.35120270 1.6182640 0.8065105 -0.6624077 68.64904
3 1.60854170 0.3917819 0.3391857 -0.6624077 22.58184
4 -0.02179795 0.0984535 -1.0949469 -1.2932721 31.84946
5 0.04259548 -0.2836195 -0.1439886 -0.6624077 17.39838
6 1.77983218 1.3392854 -0.3161629 -0.6624077 80.14296
80.3 Normalisation
Normalisation adjusts the scale of the data so that the range is between 0 and 1.
This is useful for algorithms that compute distances between data points, like K-Nearest Neighbors (KNN) and K-Means clustering.
We can use ‘min-max scaling’ in R to achieve this.
# using the same dataset created above
# normalise data using min-max
$Normalised_Variable1 <- (data$Variable1 - min(data$Variable1)) / (max(data$Variable1) - min(data$Variable1))
data
# plot original vs. normalised
<- ggplot(data, aes(x = Normalised_Variable1)) +
p3 geom_histogram(binwidth = 0.02, fill = "red", alpha = 0.7) +
ggtitle("Normalised Data")
::grid.arrange(p1, p3, ncol = 2) gridExtra
80.4 When to use
The choice between standardisation and normalisation depends on the characteristics of your data and the requirements of the algorithm you’re using.
We typically standardise data when you’re dealing with features that have a Gaussian (bell curve) distribution. Standardisation is important for models that assume that all features are centered around zero and have variance in the same order, such as:
- Linear Regression
- Logistic Regression
- Support Vector Machines
- Principal Component Analysis (PCA)
- Algorithms that compute distances or assume normality
Normalisation rescales the data into a range of [0, 1] or [-1, 1]. It’s useful when you need to scale the features so they’re in a bounded interval.
Normalisation is often the choice for models that are sensitive to the magnitude of values and where you don’t assume any specific distribution of features, such as:
- Neural Networks
- k-Nearest Neighbors (k-NN)
- k-Means Clustering
- Situations where you need to maintain zero entries in sparse data.
In practice, I’d suggest that you try both methods as part of your exploratory data analysis to determine which scaling technique works better for your specific model and dataset.